On November 2016, the world witnessed yet another important election from one of the major economic superpowers: the United States. The results left many surprised.
We obtained a dataset from Kaggle that has 34 columns and 397629 rows. Each row represents a tweet on the election day according from the earliest timestamp to the latest. A brief description of the features:
| Variable | Description |
|---|---|
| text | text of the tweet |
| created_at | date and time of the tweet (format yyyy-mm–dd hh:mm:ss) |
| geo | a JSON object containing coordinates [latitude, longitude] and a “type” |
| lang | Twitter’s guess as to the language of the tweet |
| place | a Place object from the Twitter API |
| coordinates | a JSON object containing coordinates [longitude, latitude] and a `type’; note that coordinates are reversed from the geofield |
| user.favourites.count | number of tweets the user has favorited |
| user.statuses_count | number of statuses the user has posted |
| user.description | the text of the user’s profile description |
| user.location | text of the user’s profile location |
| user.id | unique id for the user |
| user.created_at | when the user created their account |
| user.verified | bool; is user verified? |
| user.following | bool; am I (Ed King - the data creator) following this user? |
| user.url | the URL that the user listed in their profile (not necessarily a link to their Twitter profile) |
| user.listed_count | number of lists this user is on (?) |
| user.followers_count | number of accounts that follow this user |
| user.default_profile_image | bool; does the user use the default profile pic? |
| user.utc_offset | positive or negative distance from UTC, in seconds |
| user.friends_count | number of accounts this user follows |
| user.default_profile | bool; does the user use the default profile? |
| user.name | user’s profile name |
| user.lang | user’s default language |
| user.screen_name | user’s account name |
| user.geo_enabled | bool; does user have geo enabled? |
| user.profile_background_color | user’s profile background color, as hex in format “RRGGBB” (no ‘#’) |
| user.profile_image_url | a link to the user’s profile pic |
| user.time_zone | full name of the user’s time zone |
| id | unique tweet ID |
| favorite_count | number of times the tweet has been favorited |
| retweeted | bool; is this a retweet? |
| source | if a link, where is it from (e.g., “Instagram”) |
| favorited | have I (Ed King - data creator) favorited this tweet? |
| retweet_count | number of times this tweet has been retweeted |
The dataset is webscraped from Twitter by Ed King, a user by the user name of poptimality, to uncover some patterns from the election day (November 11 2016).
Reference: KDNuggets
For this group project, our general approach is to compare which methods work best for this dataset by utilizing the statistical learning methods we learned in class and the common machine learning algorithms (noted by the Lexicon-based Approach) for analyzing large texts data.
Some important packages that we may use for this project:
tm :
An R package that is useful for text mining purposes and analyzing corpus objects (such as finding associations, frequencies, etc.)
SentimentAnalysis:
An R package that is popular in sentiment analyses. Has built-in dictionaries (Harvard-IV dictionary, Henry’s Financial dictionary, and Loughran-McDonald Financial dictionary) that list each word with its associated sentiment scores.
syuzhet, tidytext:
Supplementary R packages for sentiment analysis.
caret:
A popular Machine Learning R package that enables the use of common supervised and unsupervised methods.
MASS:
An R package that enables the use of several famous statistical approaches like penalized regressions on matrices.
stringr:
An R package that incorporates Regular Expression (regex) the enables the manipulation of character objects.
plotly, ggplot2:
R packages for visualization.
magrittr:
An R package for pipelining.
Shiny:
An R package for interactive data exploration.
Of course, there will be some common R packages that we will or may be using like dplyr, purrr or readxl that are not mentioned in the list. Our plan
In sentiment analysis, the lexicon-based approach can be seen as a “nonparametric” statistical approach to meta- or unstructured data. Traditionally, the Dictionary-based Approach is seen as the “easier” method in comparison to the Corpus-based Approach. The former method uses a predefined list of “sentiment scores” from the English dictionaries to sentimental words such as “angry” (with a score of -3) and “happy” (with a score of 3) by first getting rid of common stop words in a text data. Commonly, the scores are then aggregated before classifying the text as “negative”, “neutral”, or “positive”.
## Joining, by = "word"
word.count
Code Reference: Silge & Robinson, tidytext
The above approach uses the *Dictionary-based Approach` mentioned to analyze the word count of the 6 different Jane Austen novels.
On the other hand, the Corpus-based Approach investigates the polarity of the text by using ML methods. This method relies on the context of the context or domain that can inform the sentiment labels of our text data.
## (Intercept) x
## 3.4810675 -0.5440241
Code Reference: Embry, tm
A method (from the tm package) in Corpus-based Approach is as above. By Zipf’s Law, it is stated that the frequency of a word is inversely proportional to its statistical rank. The following two plots also follow the same law.
## (Intercept) x
## 2.824209 -0.478659
## (Intercept) x
## 5.212469 -0.728547
Code Reference: Embry, tm
Here, from the context of the corpus data, we could attempt to assign the sentiment labels.
The statistical learning approach concerns about assigning and verifying labels (and some other useful ideas like dimensionality reduction, which could potentially reduce computation time when dealing with large datasets such as ours). Our concern in this analysis is how we nonparametrically assign sentiment labels without the help of a label response.
Our first approach might be to reduce the sheer number of dimensions of the text column by performing a PCA. This will help in reducing the computation time and prevent the problem of the ‘curse of dimensionality’ by picking only the useful dimensions. A biplot might be useful for this:
Code Reference: Stackoverflow
The challenge that we will face here is since we will be assigning the text variables into some columns of binary value, calculating the dissimilarity matrices may be difficult. After choosing the dimensions of the data, we will attempt to perform some common unsupervised and supervised learning methods to analyze the polarity of the texts. For that, we will be using the methods that we learned in class.
After assigning the labels of each tweet, we will be using them to predict the election results by first determining the frequencies of the sentiment labels by electoral district to predict the election results in 2016. For that, we may come across some difficulties and potential bias, which will be explained in the next section of our proposal.
For brevity and simplicity, we will present the possible challenges when undertaking this project in bullet points.
Our dataset has 34 columns and 397629 rows, with text being the feature of most interest. Due to that, data cleaning may be a problem, especially when there are many stopwords in the data, which may render some of the observations useless. Furthermore, upon further analysis, we found out that only 338331 are in English language, of which we have not yet considered which observations are located in the United States (given the fact that many users did not enable geolocation). Also, the dataset is only from the date 11/08/16 , which may introduce some bias and inaccuracy on the prediction result. Since our ultimate goal is to predict the election results and that only United States citizens can vote, this challenge may prevent our group to accurately predict them, which will probably be seen later on our confusion matrix. However, if based on depth of analysis, our group may be relatively well on the good side.
Since we will be using a dataset that has many dimensions, visualizing them might be a problem. Due to that, our group decided to use the package plotly (whenever feasible) since it has a feature that allows us to visualize a 3d plot.
Code Reference: plotly
Also, to provide a more crisp and modern look to data visualizations, we will be using plotly and Shiny for better data exploration.
Using this dataset alone to predicition the elections results may not be realistic. Afterall, more often than not, relying on sentiments alone often fall short in an argument. Due to that, we may be using some packages, like rvest or twitteR, to webscrape some of the further needed information like manifestos, state laws and size of constituencies. This will take a long time.